Back

Genome Research

Cold Spring Harbor Laboratory

Preprints posted in the last 90 days, ranked by how well they match Genome Research's content profile, based on 409 papers previously published here. The average preprint has a 0.15% match score for this journal, so anything above that is already an above-average fit.

1
10-minimizers: a promising class of constant-space minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-03-18 bioinformatics 10.64898/2026.03.16.712052 medRxiv
Top 0.1%
25.8%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.

2
A multi-flow approach for binning circular plasmids from short-reads assembly graphs

Epain, V.; Mane, A.; Della Vedova, G.; Bonizzoni, P.; Chauve, C.

2026-03-26 genomics 10.64898/2026.03.25.714305 medRxiv
Top 0.1%
19.4%
Show abstract

We address the problem of plasmid binning, that aims to group contigs - from a draft short-read assembly for a bacterial sample - into bins each expected to correspond to a plasmid present in the sequenced bacterial genome. We formulate the plasmid binning problem as a network multi-flow problem in the assembly graph and describe a Mixed-Integer Linear Program to solve it. We compare our new method, PlasBin-HMF, with state-of-the-art methods,MOB-recon, gplasCC, and PlasBin-flow, on a dataset of more than 500 bacterial samples, and show that PlasBin-HMF outperforms the other methods, by preserving the explainability.

3
Temporal Transcriptomics Identifies Isoform-specific Trans-regulation by Multiple lncRNAs in Human iPSCs

Liu, M.; Mamede, I.; Sofi, S.; Pereira, I.; Dostal, V.; Pashos, A. R. S.; McMahon, C.; Waikar, A.; Stephenson, G.; Cech, T. R.; Rinn, J. L.

2026-05-14 genomics 10.64898/2026.05.13.724994 medRxiv
Top 0.1%
18.6%
Show abstract

Some long non-coding RNAs (lncRNAs) are known to regulate gene expression. However, the underlying temporal dynamics of lncRNAs influencing gene and epigenetic regulation and mechanisms of lncRNA regulation in trans are less understood. To investigate this, we genetically engineered 17 doxycycline-inducible lncRNA transgenes for ectopic expression at the H11 safe harbor locus in human pluripotent stem cells (hiPSCs), and we generated high-density temporal RNA-seq and ATAC-seq profiles. Most lncRNA transgenes were induced at 2 hours and maintained expression through the 96-hour time course. Surprisingly, when we sought to identify gene expression changes due to the lncRNAs, we found that the global transcriptional landscape was dominated by a strong systemic response triggered by doxycycline exposure. We rigorously defined this cohort of genes as a Doxycycline-Responsive Gene Signature (DRGS). The DRGS was also present in at least 28 public datasets from dox-inducible transgene studies involving diverse cell types. Next, we determined which lncRNAs exhibited trans-regulatory events. We identified DANCR, FENDRR, LINC00667, LINC00847, LNCPRESS1, and PNKY as lncRNAs that regulate specific transcript expression in trans. The downstream target genes encoded 53 mRNAs and 10 lncRNAs. None of the target lncRNAs altered gene expression proximal to their own loci (i.e., triggering secondary cis-effects). Surprisingly, the target genes of LINC00847 (transcribed from chromosome 22) were substantially enriched on chromosome 19, with a preponderance of target genes encoding RNA metabolism and RNA splicing factors. Collectively, our study provides a resource to discern artifacts in the doxycycline-inducible system and identifies temporally regulated targets of 6 lncRNAs for future mechanistic studies.

4
Robust data-driven gene expression inference for RNA-seq using curated intergenic regions

Brandulas Cammarata, A.; Fonseca Costa, S. S.; Rosikiewicz, M.; Roux, J.; Wollbrett, J.; Bastian, F. B.; Robinson-Rechavi, M.

2026-05-20 genomics 10.1101/2022.03.31.486555 medRxiv
Top 0.1%
18.4%
Show abstract

RNA-Seq is a powerful technique to provide quantitative information on gene expression. While many applications focus on measuring expression levels, accurately distinguishing between actively and inactively transcribed genes is equally important for understanding gene function, development, and disease mechanisms. However, setting a biologically meaningful threshold for calling genes expressed is challenging due to variability in noise levels across different protocols, experiments or biological samples. We propose to define this threshold per sample relative to the background level observed in inactive genomic features, inferred by the amount of reads mapped to intergenic regions of the genome, and to call genes expressed if their level of expression is significantly higher than the estimated background noise. This approach can be applied to a single RNA-Seq library as well as to a combination of libraries from the same condition, in model and non-model organisms. We show that our method yields a more accurate prediction of expression state than existing methods, illustrated by consistent expression calls for biological replicates in the same tissue.

5
New Space-Time Tradeoffs for Subset Rank and k-mer Lookup

Diseth, A. C.; Puglisi, S. J.

2026-03-18 bioinformatics 10.64898/2026.03.16.712042 medRxiv
Top 0.1%
17.7%
Show abstract

Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.

6
scRGCL: Neighbor-Aware Graph Contrastive Learning for Robust Single-Cell Clustering

Fan, J.; Liu, F.; Lai, X.

2026-03-18 bioinformatics 10.64898/2026.03.16.712039 medRxiv
Top 0.1%
17.1%
Show abstract

Accurate cell type identification is a fundamental step in single-cell RNA sequencing (scRNA-seq) data analysis, providing critical insights into cellular heterogeneity at high resolution. However, the high dimensionality, zero-inflated, and long-tailed distribution of scRNA-seq data pose significant computational challenges for conventional clustering approaches. Although recent deep learning-based methods utilize contrastive learning to joint-learn representations and clustering assignments, they often overlook cluster-level information, leading to suboptimal feature extraction for downstream tasks. To address these limitations, we propose scRGCL, a single-cell clustering method that learns a regularized representation guided by contrastive learning. Specifically, scRGCL captures the cell-type-associated expression structure by clustering similar cells together while ensuring consistency. For each sample, the model performs negative sampling by selecting cells from distinct clusters, thereby ensuring semantic dissimilarity between the target cell and its negative pairs. Moreover, scRGCL introduces a neighbor-aware re-weighting strategy that increases the contribution of samples from clusters closely related to the target. This mechanism prevents cells from the same category from being mistakenly pushed apart, effectively preserving intra-cluster compactness. Extensive experiments on fourteen public datasets demonstrate that scRGCL consistently outperforms state-of-the-art methods, as evidenced by significant improvements in normalized mutual information (NMI) and adjusted rand index (ARI). Moreover, ablation studies confirm that the integration of cluster-aware negative sampling and the neighbor-aware re-weighting module is essential for achieving high-fidelity clustering. By harmonizing cell-level contrast with cluster-level guidance, scRGCL provides a robust and scalable framework that advances the precision of automated cell-type discovery in increasingly complex single-cell landscapes. Key MessagesO_LIscRGCL uses contrastive learning on a regularized representation for single-cell clustering. C_LIO_LIscRGCL outperforms four state-of-the-art methods on 15 datasets. C_LIO_LIscRGCLs cluster-aware negative sampling and the neighbor-aware re-weighting modules are essential for high-fidelity single cell clustering. C_LI

7
Hierarchical genomic feature annotation with variable-length queries

Alanko, J. N.; Ranallo-Benavidez, T. R.; Barthel, F. P.; Puglisi, S. J.; Marchet, C.

2026-03-18 bioinformatics 10.64898/2026.03.15.711907 medRxiv
Top 0.1%
17.1%
Show abstract

K-mer-based methods are widely used for sequence classification in metagenomics, pangenomics, and RNA-seq analysis, but existing tools face important limitations: they typically require a fixed k-mer length chosen at index construction time, handle multi-matching k-mers (whose origin in the indexed data is ambiguous) in ad-hoc ways, and some resort to lossy approximations, complicating interpretation. We present HKS, a data structure for exact hierarchical variable-length k-mer annotation. Building on the Spectral Burrows- Wheeler Transform (SBWT), a single HKS index is constructed for a specified maximum query length s, and supports queries at any length k [≤] s. HKS associates each k-mer with exactly one label from a user-defined category hierarchy, where multi-matching k-mers are resolved to their most specific common node in the hierarchy. We formalize a feature assignment framework that partitions indexed k-mers into disjoint sets according to a user-defined category hierarchy. To recover specificity lost to multi-matching and novel k-mers, we introduce a hierarchy-aware smoothing algorithm that makes use of flanking sequence context. We validate the approach by assigning each query k-mer to a specific chromosome across human genome assemblies, including the T2T-CHM13v2.0 reference as a positive control and two diploid genomes of different ancestries (HG002, NA19185). Smoothing increases overall concordance from [~]81% to [~]97%, with residual errors attributable to known biological phenomena including acrocentric short-arm recombination and subtelomeric duplications. In performance benchmarks against Kraken2, HKS provides comparable query throughput while providing exact, lossless annotation across all k-mer lengths simultaneously from a single index. A prototype implementation is available at https://github.com/jnalanko/HKS.

8
Significantly Improved Mouse and Rat Genome Annotation Using Sequence Read Archive RNA-seq Data

Meng, F.; Turner, D. L.; Hagenauer, M. H.; Watson, S.; Akil, H.

2026-03-09 genomics 10.64898/2026.03.06.709975 medRxiv
Top 0.1%
14.9%
Show abstract

To detect currently unannotated genes with low expression levels with high sensitivity and accuracy, we developed a new exon->gene->transcript annotation pipeline that can identify previously undetected multi-exon transcripts using large volumes of RNA-Seq data. Our pipeline incorporates three new algorithms: 1) model-based spliced exon detection, 2) exon-to-gene assignment across multiple tissue/datasets through exon community discovery, and 3) ranking top transcripts by a stepwise minimum flow procedure. The design of our pipeline allowed us to leverage hundreds of Tbases of public RNA-seq data as input to improve mouse and rat genome annotation. Using this data, our pipeline identified close to 15K and 21K unannotated genes in GENCODE M37 and ENSEMBL 114 for mouse and rat, respectively. Each species also gained over 200K predicted transcripts containing at least one new exon, although most were transcripts from GENCODE/ENSEMBL annotated genes with newly assigned exons. To make our genome annotation available for common use, we have packaged this new annotation in standard file formats for the analysis of bulk and single cell RNA-seq data (GTF, 10X genome files). We have also provided two use examples which demonstrate the utility of our newly annotated genes in functional analyses, showing that their expression can be differentially regulated in relationship to cell type and selective breeding. Due to the efficiency provided by our pipeline, we expect that as new RNA-seq data become available in the coming years it will significantly benefit rat gene/transcript annotation, eventually enabling us to approach the target of complete gene and transcript annotation.

9
Cell type-specific gene regulatory network inference from single cell transcriptomics with ctOTVelo

Chang, S.; Zhao, W.; Ma, Y.; Sandstede, B.; Singh, R.

2026-03-14 genomics 10.64898/2026.03.11.711174 medRxiv
Top 0.1%
14.7%
Show abstract

Inferring gene regulatory networks (GRNs) from gene expression is a crucial task for understanding functional relationships. Gene expression data (transcriptomics) provide a snapshot of gene activity, encoding information about gene regulatory relationships. However, gene regulation is a dynamic process, modulating across time and with different cell types. Temporal GRN inference methods aim to capture these dynamics by utilizing time-stamped transcriptomics, gene expression data of similar samples captured across discrete timepoints, or pseudotime transcriptomics, computationally ordering cells based on an inferred trajectory. These methods can estimate constant or temporal gene regulatory relationships, but may not capture finer, cell type specific relationships. We propose ctOTVelo, an extension to our previous work to account for cell type specificity during GRN inference. ctOTVelo incorporates cell type labels or proportions when inferring the GRN from single cell transcriptomics data. Our methods achieve state-of-the-art performance in GRN prediction in time-stamped and pseudotime-stamped transcriptomics. Furthermore, ctOTVelo is able to generate cell type specific GRNs, allowing cell type resolution analysis of gene regulatory relationships.

10
NanoLabel: A fast and accurate real-time nanopore signal classifier

Mahajan, D.; Jain, C.; Kashyap, N.

2026-05-06 genomics 10.64898/2026.05.03.722500 medRxiv
Top 0.1%
14.5%
Show abstract

Oxford Nanopore Technologies adaptive sampling capability promises to reduce sequencing cost and turnaround time. At its core, adaptive sampling is a real-time classification problem that distinguishes reads originating from regions of interest. Direct signal-based classification approaches bypass the computational bottleneck of basecalling and can eliminate the need for powerful GPUs. However, operating directly on noisy raw signals remains challenging in real-time settings, where classification decisions must be made quickly. In this work, we propose NanoLabel, a new method for real-time classification of nanopore signals. We build NanoLabel on top of signal-based read mapping tool, RawHash2. We accelerate the classification workflow by mapping reads using only the target regions as the reference. To further improve accuracy, we train a lightweight classifier on mapping-derived features and introduce a data augmentation strategy to construct sufficiently large and class-balanced training datasets. We evaluate NanoLabel using publicly available real sequencing datasets from three human genomes (HG001, HG002, and HG005), while assuming a cancer gene panel as the target. Compared to directly mapping reads with RawHash2, we demonstrate 80 x improvement in the classification time and 0.10 - 0.25 units improvement in the F1 score.

11
PerturbPlan: An analytical framework for designing Perturb-seq experiments

Niu, Z.; He, Y.; Galante, J.; Gschwind, A. R.; Ray, J.; Steinmetz, L. M.; Engreitz, J. M.; Katsevich, E.

2026-05-23 genomics 10.64898/2026.05.22.727199 medRxiv
Top 0.1%
14.2%
Show abstract

CRISPR screens with single-cell RNA-seq readouts provide a powerful tool for characterizing the functions of noncoding elements and genes. However, designing these experiments to balance statistical power and cost is challenging, given the large number of design parameters. The only available tool for this purpose is a simulation-based power calculator, but it is computationally costly and requires high-performance computing to run. We derive a novel analytical formula for the power to detect perturbation-expression associations, recapitulating power estimates from the simulation-based tool while reducing runtime by up to seven orders of magnitude. This acceleration unlocks the possibility of interactive single-cell CRISPR screen design. Accordingly, we develop PerturbPlan, an interactive web application built on the analytical power formula. PerturbPlan helps users address 11 design questions for two types of single-cell CRISPR screens, Perturb-seq and targeted Perturb-seq (TAP-seq). We apply PerturbPlan to carry out a comparative analysis of three recent Perturb-seq designs, demonstrating how optimal design varies across experiments of different scales. We also use PerturbPlan to quantify the cost savings of a recent TAP-seq study relative to a hypothetical Perturb-seq study assaying the same perturbations, illustrating how the tool can inform decisions about targeted versus whole-transcriptome readouts. In sum, PerturbPlan is the first tool to facilitate flexible and interactive design of well-powered single-cell CRISPR screen experiments.

12
A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers

Durbin, R.

2026-03-29 bioinformatics 10.64898/2026.03.26.714584 medRxiv
Top 0.1%
14.0%
Show abstract

Skiplists (Pugh, 1990) are probabilistic data structures over ordered lists supporting [O] (log N) insertion and search, which share many properties with balanced binary trees. Previously we introduced the graph Burrows-Wheeler transform (GBWT) to support efficient search over pangenome path sets, but current implementations are static and cumbersome to build and use. Here we introduce two doubly-linked skiplist variants over run-length-compressed BWTs that support [O] (log N) rank, access and insert operations. We use these to store and search over paths through a syncmer graph built from Edgars closed syncmers, equivalent to a sparse de Bruijn graph. Code is available in rskip.[ch] within the syng package at github.com/richarddurbin/syng. This builds a 5.8 GB lossless GBWT representation of 92 full human genomes (280Gbp including all centromeres and other repeats) single-threaded in 52 minutes, on top of a 4GB 63bp syncmer set built in 37 minutes. Arbitrarily long maximal exact matches (MEMs) can then be found as seeds for sequence matches to the graph at a search rate of approximately 1Gbp per 10 seconds per thread.

13
NovaClone: A Network-Based Algorithm for Clonal and Subclonal Genotyping of Barcoded Transgene Integrations

Prillo, S.; Rimini, D.; Olivares-Chauvet, P.; Song, Y. S.; Yosef, N.

2026-05-13 genomics 10.64898/2026.05.11.724244 medRxiv
Top 0.1%
14.0%
Show abstract

Single-cell lineage tracing technologies are providing new and powerful ways to interrogate the evolution and divergence of cell populations in cancer, development, and other contexts. A key initial step in any such analysis is the grouping of cells into clonal populations, based on clone-level marks. Unfortunately, clone calling is prone to technical effects due to sequencing errors, missing data, multiplets, background noise, and accidental sharing of clonal barcodes between unrelated clones (homoplasy). We present NovaClone, a principled algorithm for hierarchical clone calling that is broadly applicable to all current tracing technologies, including both static barcoding and the more recent evolving tracers. We benchmark NovaClone on simulated and real data to show that it outperforms the current solutions in terms of both quality and speed, thereby helping to mitigate one of the most prevalent problems with single-cell lineage tracing. To complement NovaClone, we introduce a suite of algorithm-agnostic quality control metrics to evaluate clone calls when ground truth is not available. NovaClone and the associated QCs are available through the open source Python package nova-clone.

14
scPlOver: inferring DNA content from amplification-free single-cell WGS using fragment overlaps

Myers, M. A.; Satas, G.; Shah, S.; Mcpherson, A.

2026-05-10 genomics 10.64898/2026.05.06.722337 medRxiv
Top 0.1%
13.9%
Show abstract

Correctly inferring copy-number aberrations from single-cell DNA sequencing data requires estimating cellular DNA content, which is unidentifiable from read counts alone. In tagmentation-based sequencing, each fragment represents a distinct DNA molecule, thus fragment overlaps provide an orthogonal signal for copy number. We present a theoretical model of fragment overlaps as a function of copy number and coverage and introduce scPlOver, a method that uses this model to infer DNA content. scPlOver outperforms existing approaches on simulated and experimental datasets and identifies thousands of ovarian cancer cells with higher DNA content than previously estimated across a cohort of 41 patients.

15
GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Hedley, J. G.; Torr, P. H. S.; Märtens, K.

2026-04-20 genomics 10.64898/2026.04.16.718976 medRxiv
Top 0.1%
12.6%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWGenePT introduced a simple recipe for gene representations: embed each genes natural-language description with a general-purpose text embedding model and reuse the resulting vectors across downstream tasks. Since GenePTs release, embedding models have improved rapidly, with many strong open and commercial encoders benchmarked on suites such as the Massive Text Embedding Benchmark (MTEB). We present a controlled "leaderboard" study that keeps the GenePT pipeline fixed and varies only the embedding backbone. We benchmark contemporary encoders on four diverse gene embedding tasks: gene-gene interaction prediction, gene property classification, cell type classification, and prediction of transcriptomic responses to unseen genetic perturbations. Across these settings, newer backbones consistently outperform the original GenePT backbone (text-embedding-ada-002), achieving improvements of 1-17%, while enabling fully reproducible research by avoiding API dependencies.

16
Pareto optimization of masked superstrings improves compression of pan-genome k-mer sets

Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.

2026-03-20 bioinformatics 10.64898/2026.03.18.712440 medRxiv
Top 0.1%
12.5%
Show abstract

The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum--Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.

17
Sample barcoding-associated technical variation in probe-based single-cell RNA sequencing

Weir, J. A.; Krebs, Y.; Chen, F.

2026-04-08 genomics 10.64898/2026.04.06.716804 medRxiv
Top 0.1%
12.4%
Show abstract

Probe-based single cell RNA sequencing approaches are increasingly becoming a technology of choice for profiling gene expression at scale and in archival tissues. The 10x Genomics Flex v1 assay enables cost-effective and high-sensitivity single-cell RNA sequencing by splitting samples across up to 16 uniquely barcoded probe sets before pooling and loading onto a single lane of a microfluidic chip. A natural consequence of this design is to leverage probe set barcoding as a sample barcoding strategy for case-control experiments. However, we observed that Flex v1 probe set barcode identity drives substantial technical variation between probe set barcodes, an effect that is reproducible across lanes and independent datasets. When Flex v1 probe set barcodes are confounded with biological sample identity, a concerning number of differentially expressed genes at standard thresholds are false positives. The Flex v2 assay, which decouples sample barcoding from probe set hybridization, significantly reduces this artifact. As the field continues to expand adoption of probe-based assays, our findings introduce probe set barcoding as an underappreciated source of technical variation in single-cell assays and emphasize the importance of experimental design when using probe-based sequencing technologies.

18
mChIP-seq for Multiplex and Multifactorial Epigenomic Profiling Uncovers Cancer-specific Histone Features in Cellular and Circulating Nucleosomes

Sun, C.; Zhang, Q.; Yan, J.; Wang, X.; Zhang, C.; Li, Y.; Li, J.; Xu, W.

2026-04-29 genomics 10.64898/2026.04.27.721226 medRxiv
Top 0.1%
12.3%
Show abstract

Epigenomic profiling facilitates access to investigate regulatory roles of histone marks in a type-specific cell, and serves as a critical path for discovering noninvasive epigenetic models in cell-free nucleosomes. Here, we present mChIP-seq, an epigenomic profiling technology that is compatible with both cell and cell-free samples for synchronously profiling multifactorial epigenetic landscapes on multiple samples. Combining sample indexing in a single reaction with a pool-and-split strategy for immunoprecipitation, mChIP-seq enhances efficiency and reduces cost. Using mChIP-seq, we profiled H2A.Z and 10 histone modifications in cell lines representing 9 cancer types. Integrative analyses further revealed an atypical association of H2A.Z and H3K4me3 at promoter regions in cancer. Based on mChIP-seq, we developed cf-mChIP-seq for circulating nucleosomes, which requires as little as 25 l of plasma per profile. Profiling 38 plasma samples for H2A.Z, H3K4me3, H3K27ac, and H3K9me3 with cf-mChIP-seq revealed distinct histone mark-associated cfDNA fragment patterns in breast cancer versus healthy control, highlighting the potential of cf-mChIP-seq to expand liquid biopsy methodologies. These results demonstrate that mChIP-seq is a widely applicable technology for large-scale epigenomic profiling of nucleosomes in cellular or cell-free forms.

19
GraphHDBSCAN*: Graph-based Hierarchical Clustering on High Dimensional Single-cell RNA Sequencing Data

Ghoreishi, S. A.; Szmigiel, A. W.; Nagai, J. S.; Gesteira Costa Filho, I.; Zimek, A.; Campello, R. J. G. B.

2026-03-26 bioinformatics 10.64898/2026.03.24.713924 medRxiv
Top 0.1%
12.3%
Show abstract

Single-cell RNA sequencing (scRNA-seq) is widely used to resolve cellular heterogeneity across thousands to millions of cells. A major challenge is to identify biologically meaningful cell populations while preserving their hierarchical organization, because broad cell types frequently split into more specialized subtypes. However, state-of-the-art approaches mostly focus on flat partitions and ignore the hierarchical structure of single-cell data. Here we introduce GraphHDBSCAN*, a graph-based, hyperparameter-free extension of HDBSCAN* that performs hierarchical density-based clustering on a graph representation of the data, enabling robust recovery of both single-level and hierarchical relationships in high-dimensional and sparse datasets. We evaluate GraphHDBSCAN* across multiple scRNA-seq datasets and show that it recovers biologically meaningful hierarchies that reveal fine-grained structure in complex data, including monocyte subpopulations. In addition, the method yields high-quality flat partitions that outperform widely used community-detection methods.

20
TCRseek: Scalable Approximate Nearest Neighbor Search for T-Cell Receptor Repertoires via Windowed k-mer Embeddings

Yang, Y.

2026-03-24 bioinformatics 10.64898/2026.03.20.713313 medRxiv
Top 0.1%
12.2%
Show abstract

The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman-Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation--where the reranking and ground truth metrics differ, providing the most informative test of generalization--TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors--the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6-39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.